Lab 09a: K-means clustering

Introduction

This lab focuses on $K$-means clustering using the Iris flower data set. At the end of the lab, you should be able to:

  • Create a $K$-means clustering model for various cluster sizes.
  • Estimate the right number of clusters to choose by plotting the total inertia of the clusters and finding the "elbow" of the curve.

Getting started

Let's start by importing the packages we'll need. As usual, we'll import pandas for exploratory analysis, but this week we're also going to use the cluster subpackage from scikit-learn to create $K$-means models and the datasets subpackage to access the Iris data set.


In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
from sklearn import cluster
from sklearn import datasets

Next, let's load the data. The iris data set is included in scikit-learn's datasets submodule, so we can just load it directly like this:


In [ ]:
iris = datasets.load_iris()
X = pd.DataFrame({k: v for k, v in zip(iris.feature_names, iris.data.T)})  # Convert the raw data to a data frame
X.head()

Exploratory data analysis

Let's start by making a scatter plot matrix of our data. We can colour the individual scatter points according to their true class labels by passing c=iris.target to the function, like this:


In [ ]:
pd.plotting.scatter_matrix(X, c=iris.target, figsize=(9, 9));

The colours of the data points here are our ground truth, that is the actual labels of the data. Generally, when we cluster data, we don't know the ground truth, but in this instance it will help us to assess how well $K$-means clustering segments the data into its true categories.

K-means clustering

Let's build an $K$-means clustering model of the document data. scikit-learn supports $K$-means clustering functionality via the cluster subpackage. We can use the KMeans class to build our model.

3 clusters

Generally, we won't know in advance how many clusters to use but, as we do in this instance, let's start by splitting the data into three clusters. We can run $K$-means clustering with scikit-learn using the KMeans class. We can specify n_clusters=3 to find three clusters, like this:


In [ ]:
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)

Note: In previous weeks, we have called fit(X, y) when fitting scikit-learn estimators. However, in each of these cases, we were fitting supervised learning models where y represented the true class labels of the data. This week, we're fitting $K$-means clustering models, which are unsupervised learners, and so there is no need to specify the true class labels (i.e. y).

When we call the predict method on our fitted estimator, it predicts the class labels for each record in our explanatory data matrix (i.e. X):


In [ ]:
labels = k_means.predict(X)
print labels

We can check the results of our clustering visually by building another scatter plot matrix, this time colouring the points according to the cluster labels:


In [ ]:
pd.plotting.scatter_matrix(X, c=labels, figsize=(9, 9));

As can be seen, the $K$-means algorithm has partitioned the data into three distinct sets, using just the values of petal length, petal width, sepal length and sepal width. The clusters do not precisely correspond to the true class labels plotted earlier but, as we usually perform clustering in situations where we don't know the true class labels, this seems like a reasonable attempt.

Other numbers of clusters

We can cluster the data into arbitrary many clusters (up to the point where each sample is its own cluster). Let's cluster the data into two clusters and see what effect this has:


In [ ]:
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X)

labels = k_means.predict(X)
pd.plotting.scatter_matrix(X, c=labels, figsize=(9, 9));

Finding the optimum number of clusters

One way to find the optimum number of clusters is to plot the variation in total inertia with increasing numbers of clusters. Because the total inertia decreases as the number of clusters increases, we can determine a reasonable, but possibly not true, clustering of the data by finding the "elbow" in the curve, which occurs as a result of the diminishing returns from adding further clusters.

We can access the inertia value of a fitted $K$-means model using its inertia_ attribute, like this:


In [ ]:
clusters = range(1, 10)
inertia = []
for n in clusters:
    k_means = cluster.KMeans(n_clusters=n)
    k_means.fit(X)
    inertia.append(k_means.inertia_)

plt.plot(clusters, inertia)
plt.xlabel("Number of clusters")
plt.ylabel("Inertia");

In this instance, we could choose either two or three clusters to represent the data, as these represent the largest decreases in inertia. As we know that there are three true classes choosing two would be an incorrect conclusion in this case, but this is an unavoidable consequence of clustering. If we do not know the structure of the data in advance, we always risk choosing a representation of it that does not reflect the ground truth.